11.2.1 Exercises

  1. What happens if you repeat the above analysis with all diamonds? (Not just all diamonds with two or fewer carats). What does the strange geometry of log(carat) vs relative price represent? What does the diagonal line without any points represent?

Consider the linear relationship between lcarat and lprice:

diamonds2 <- diamonds %>% 
  mutate(
    lcarat = log2(carat),
    lprice = log2(price)
  )
diamonds2
## # A tibble: 53,940 x 12
##    carat       cut color clarity depth table price     x     y     z
##    <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
##  2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
##  3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
##  4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
##  5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good     I    VVS1  62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good     H     SI1  61.9    55   337  4.07  4.11  2.53
##  9  0.22      Fair     E     VS2  65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good     H     VS1  59.4    61   338  4.00  4.05  2.39
## # ... with 53,930 more rows, and 2 more variables: lcarat <dbl>,
## #   lprice <dbl>
ggplot(diamonds2, aes(lcarat,lprice)) +
  geom_bin2d() +
  geom_smooth(method="lm",se=FALSE,size=2,color="yellow")

We can see above that the x-axis has been extended to the right compared to the original plot (since the original plot was limited to 2 or fewer carats). The range of lprice did not increase.

mod <- lm(lprice ~ lcarat, data=diamonds2)
diamonds2 <- diamonds2 %>% mutate(rel_price = resid(mod))
diamonds2
## # A tibble: 53,940 x 13
##    carat       cut color clarity depth table price     x     y     z
##    <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
##  2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
##  3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
##  4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
##  5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good     I    VVS1  62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good     H     SI1  61.9    55   337  4.07  4.11  2.53
##  9  0.22      Fair     E     VS2  65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good     H     VS1  59.4    61   338  4.00  4.05  2.39
## # ... with 53,930 more rows, and 3 more variables: lcarat <dbl>,
## #   lprice <dbl>, rel_price <dbl>
ggplot(diamonds2,aes(carat,rel_price)) + geom_bin2d()

The above plot is not good as it is showing a distinct pattern in the residuals. The linear model is overpredicting the log price for large carats. The variance of the residuals is significantly higher for smaller carats. Looking back at the picture of the linear model, the model extends beyond the range of lprice, thus there are only points below the line at this point (which is why we see negative residual.)

color_cut <- diamonds2 %>%
  group_by(color, cut) %>%
  summarize(
    price = mean(price),
    rel_price = mean(rel_price)
  )
color_cut
## # A tibble: 35 x 4
## # Groups:   color [?]
##    color       cut    price   rel_price
##    <ord>     <ord>    <dbl>       <dbl>
##  1     D      Fair 4291.061 -0.06695865
##  2     D      Good 3405.382 -0.03902942
##  3     D Very Good 3470.467  0.10763401
##  4     D   Premium 3631.293  0.11368869
##  5     D     Ideal 2629.095  0.21605978
##  6     E      Fair 3682.312 -0.16154694
##  7     E      Good 3423.644 -0.04676044
##  8     E Very Good 3214.652  0.06901578
##  9     E   Premium 3538.914  0.08742779
## 10     E     Ideal 2597.550  0.17323162
## # ... with 25 more rows
ggplot(color_cut, aes(color, price)) +
  geom_line(aes(group=cut), color="grey80") + 
  geom_point(aes(color = cut))

Above we see the same trend as in the book, where the lowest quality diamonds (color=J) hae the highest price, especially for Premium cut diamonds. Plotting relative price against color solves the issue.

ggplot(color_cut, aes(color, rel_price)) +
  geom_line(aes(group = cut), color = "grey80") + 
  geom_point(aes(color=cut))

  1. I made an unsupported assertion that lower-quality diamonds tend to be larger. Support my claim with a plot.
ggplot(diamonds, aes(color, carat)) + geom_boxplot()

  1. Can you create a plot that simultaneously shows the effect of color, cut, and clarity on relative price? If there’s too much information to show on one plot, think about how you might create a sequence of plots to convey the same message.
color_cut_clarity <- diamonds2 %>%
  group_by(color, cut, clarity) %>%
  summarize(
    price = mean(price),
    rel_price = mean(rel_price)
  )
ggplot(color_cut_clarity, aes(color,rel_price)) + 
  geom_line(aes(group=cut), color="grey80") + 
  geom_point(aes(color=cut)) + 
  facet_wrap(~clarity)

  1. How do depth and table related to the relative price?

Table and depth do not have much relatiohship with relative price. This makes sense as table and depth are highly correlated to carat (weight) and we’ve already taken out the effect of carat.

#diamonds2
ggplot(diamonds2,aes(depth,rel_price)) +
  geom_bin2d()

#diamonds2
ggplot(diamonds2,aes(table,rel_price)) +
  geom_bin2d()

11.3.1 Exercises

  1. The final plot shows a lot of short-term noise in the overall trend. How could you smooth this further to focus on long-term changes?

Instead of using geom_line, we can draw a loess curve using geom_smooth:

txhousing
## # A tibble: 8,602 x 9
##       city  year month sales   volume median listings inventory     date
##      <chr> <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl>    <dbl>
##  1 Abilene  2000     1    72  5380000  71400      701       6.3 2000.000
##  2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.083
##  3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.167
##  4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.250
##  5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.333
##  6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.417
##  7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.500
##  8 Abilene  2000     8   131 10710000  75000      765       6.4 2000.583
##  9 Abilene  2000     9   104  7615000  64500      771       6.5 2000.667
## 10 Abilene  2000    10   101  7040000  59300      764       6.6 2000.750
## # ... with 8,592 more rows
#de-season function:
deseas <- function(x,month){
  resid(lm(x ~ factor(month), na.action = na.exclude))
}

txhousing <- txhousing %>%
  group_by(city) %>%
  mutate(rel_sales = deseas(log(sales), month))

ggplot(txhousing, aes(date,rel_sales)) + 
  geom_line(aes(group=city), alpha = 1/5) +
  #geom_line(stat = "summary", fun.y="mean", color="red")
  geom_smooth(method="loess",se=FALSE)
## Warning: Removed 568 rows containing non-finite values (stat_smooth).
## Warning: Removed 430 rows containing missing values (geom_path).

  1. If you look closely (e.g. + xlim(2008,2012)) at the long-term trend you’ll notice a weird pattern in 2009-2011. It looks like there was a big dip in 2010. Is this dip “real”? (i.e. can you spot it in the original data)

Looking at the relative sales (after taking out the month effect), a number of cities do indeed tend to have a dip in 2010:

ggplot(txhousing, aes(date,rel_sales)) + 
  geom_line(aes(group=city), alpha = 1/5) +
  geom_line(stat = "summary", fun.y="mean", color="red") +
  #geom_smooth(method="loess",se=FALSE) +
  xlim(2008,2012)
## Warning: Removed 6377 rows containing non-finite values (stat_summary).
## Warning: Removed 6376 rows containing missing values (geom_path).

TODO: Maybe there are more missing data points in 2010? Check.

sales2010 <- txhousing %>% filter(year==2010)
sales2010
## # A tibble: 552 x 10
## # Groups:   city [46]
##       city  year month sales   volume median listings inventory     date
##      <chr> <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl>    <dbl>
##  1 Abilene  2010     1    73  9130783 112200      868       6.4 2010.000
##  2 Abilene  2010     2    93 10372904  98300      830       6.1 2010.083
##  3 Abilene  2010     3   133 16517713 114000      854       6.3 2010.167
##  4 Abilene  2010     4   161 18788002 103600      859       6.3 2010.250
##  5 Abilene  2010     5   200 22804393  99300      914       6.5 2010.333
##  6 Abilene  2010     6   169 23216943 127900      932       6.7 2010.417
##  7 Abilene  2010     7   159 22363123 127300      915       6.6 2010.500
##  8 Abilene  2010     8   144 17504580 122000      936       6.7 2010.583
##  9 Abilene  2010     9   116 15475763 121300      899       6.5 2010.667
## 10 Abilene  2010    10   111 14570529 111900      863       6.4 2010.750
## # ... with 542 more rows, and 1 more variables: rel_sales <dbl>
  1. What other variables in the TX housing data show strong seasonal effects? Does this technique help to remove them?
summary(txhousing)
##      city                year          month            sales       
##  Length:8602        Min.   :2000   Min.   : 1.000   Min.   :   6.0  
##  Class :character   1st Qu.:2003   1st Qu.: 3.000   1st Qu.:  86.0  
##  Mode  :character   Median :2007   Median : 6.000   Median : 169.0  
##                     Mean   :2007   Mean   : 6.406   Mean   : 549.6  
##                     3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.: 467.0  
##                     Max.   :2015   Max.   :12.000   Max.   :8945.0  
##                                                     NA's   :568     
##      volume              median          listings       inventory     
##  Min.   :8.350e+05   Min.   : 50000   Min.   :    0   Min.   : 0.000  
##  1st Qu.:1.084e+07   1st Qu.:100000   1st Qu.:  682   1st Qu.: 4.900  
##  Median :2.299e+07   Median :123800   Median : 1283   Median : 6.200  
##  Mean   :1.069e+08   Mean   :128131   Mean   : 3217   Mean   : 7.175  
##  3rd Qu.:7.512e+07   3rd Qu.:150000   3rd Qu.: 2954   3rd Qu.: 8.150  
##  Max.   :2.568e+09   Max.   :304200   Max.   :43107   Max.   :55.900  
##  NA's   :568         NA's   :616      NA's   :1424    NA's   :1467    
##       date        rel_sales      
##  Min.   :2000   Min.   :-1.8473  
##  1st Qu.:2004   1st Qu.:-0.1404  
##  Median :2008   Median : 0.0117  
##  Mean   :2008   Mean   : 0.0000  
##  3rd Qu.:2012   3rd Qu.: 0.1531  
##  Max.   :2016   Max.   : 0.9080  
##                 NA's   :568
#"Months inventory": amount of time it would take to sell all current listings at current pace of sales. 
ggplot(txhousing, aes(date,log(inventory))) + 
  geom_line(aes(group=city), alpha=1/2)
## Warning: Removed 603 rows containing missing values (geom_path).

#Total active listings
ggplot(txhousing, aes(date,log(listings))) +
  geom_line(aes(group=city), alpha=1/2)
## Warning: Removed 518 rows containing missing values (geom_path).

#Total value of sales
ggplot(txhousing, aes(date,log(volume))) +
  geom_line(aes(group=city), alpha=1/2)
## Warning: Removed 430 rows containing missing values (geom_path).

#Median sale price:
ggplot(txhousing, aes(date,log(median))) +
  geom_line(aes(group=city), alpha=1/2)
## Warning: Removed 446 rows containing missing values (geom_path).

  1. Not all the cities in this data set have complete time series. Use your dplyr skills to figure out how much data each city is missing. Display the results with a visualization.
numDates <- txhousing %>% 
  summarize(nDates=n_distinct(date)) %>% 
  select(nDates) %>% 
  max()
numDates
## [1] 187
cityCnts <- txhousing %>%
  na.omit() %>%
  group_by(city) %>%
  summarize(nComplete=n()) %>%
  mutate(pctComplete=nComplete/numDates)
cityCnts
## # A tibble: 46 x 3
##                     city nComplete pctComplete
##                    <chr>     <int>       <dbl>
##  1               Abilene       186   0.9946524
##  2              Amarillo       182   0.9732620
##  3             Arlington       186   0.9946524
##  4                Austin       187   1.0000000
##  5              Bay Area       186   0.9946524
##  6              Beaumont       187   1.0000000
##  7       Brazoria County       129   0.6898396
##  8           Brownsville        82   0.4385027
##  9 Bryan-College Station       187   1.0000000
## 10         Collin County       186   0.9946524
## # ... with 36 more rows
ggplot(cityCnts, aes(city,pctComplete)) + 
  geom_col() +
  labs(x="City", y="% Complete") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

  1. Replicate the computation that stat_summary() did with dplyr so you can plot the data “by hand”.

Comment: I think this question is refering to the geom_line shown on page 229 using stat="summary" and fun.y="mean".

meanRelSalesData <- txhousing %>%
  group_by(date) %>%
  summarize(meanRelSales = mean(rel_sales, na.rm=TRUE))
meanRelSalesData
## # A tibble: 187 x 2
##        date meanRelSales
##       <dbl>        <dbl>
##  1 2000.000   -0.2625059
##  2 2000.083   -0.1535309
##  3 2000.167   -0.1775694
##  4 2000.250   -0.3273133
##  5 2000.333   -0.2263179
##  6 2000.417   -0.2185162
##  7 2000.500   -0.3047979
##  8 2000.583   -0.1751718
##  9 2000.667   -0.2235483
## 10 2000.750   -0.2263248
## # ... with 177 more rows
ggplot(txhousing, aes(date,rel_sales)) + 
  geom_line(aes(group=city), alpha = 1/5) +
  geom_line(aes(date,meanRelSales),data=meanRelSalesData,color="red")
## Warning: Removed 430 rows containing missing values (geom_path).

11.5.1 Exercises

  1. Do your conclusions change if you use a different measurement of model fit like AIC or deviance? Why/ why not?
#install.packages("broom")
library(broom)
models <- txhousing %>%
  group_by(city) %>%
  do(mod = lm(log2(sales) ~ factor(month), data = ., na.action = na.exclude))
models
## Source: local data frame [46 x 2]
## Groups: <by row>
## 
## # A tibble: 46 x 2
##                     city      mod
##  *                 <chr>   <list>
##  1               Abilene <S3: lm>
##  2              Amarillo <S3: lm>
##  3             Arlington <S3: lm>
##  4                Austin <S3: lm>
##  5              Bay Area <S3: lm>
##  6              Beaumont <S3: lm>
##  7       Brazoria County <S3: lm>
##  8           Brownsville <S3: lm>
##  9 Bryan-College Station <S3: lm>
## 10         Collin County <S3: lm>
## # ... with 36 more rows
model_sum <- models %>% glance(mod)
model_sum
## # A tibble: 46 x 12
## # Groups:   city [46]
##                     city r.squared adj.r.squared     sigma statistic
##                    <chr>     <dbl>         <dbl>     <dbl>     <dbl>
##  1               Abilene 0.5298893     0.5003394 0.2817299 17.932066
##  2              Amarillo 0.4493699     0.4147588 0.3015624 12.983427
##  3             Arlington 0.5132337     0.4826369 0.2673618 16.774130
##  4                Austin 0.4873283     0.4551032 0.3102974 15.122641
##  5              Bay Area 0.5552242     0.5272669 0.2646016 19.859696
##  6              Beaumont 0.4304326     0.3946312 0.2754690 12.022792
##  7       Brazoria County 0.5082062     0.4746054 0.2919275 15.124817
##  8           Brownsville 0.1714455     0.1187629 0.4196615  3.254307
##  9 Bryan-College Station 0.6511249     0.6291956 0.4061146 29.692015
## 10         Collin County 0.6009165     0.5758312 0.2658377 23.954970
## # ... with 36 more rows, and 7 more variables: p.value <dbl>, df <int>,
## #   logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
ggplot(model_sum, aes(AIC, reorder(city,AIC))) +
  geom_point()

Looking at AIC: Harlingen, San Marcos, and Brownsville have the highest AIC (2 of the three are the same cities that had the lowest \(R^2\)). Dallas, El Paso, and Midland have the lowest AIC, which are different than those having the highest \(R^2\). (Recall low AIC is better.)

worstAIC <- c("Harlingen", "San Marcos", "Brownsville")
bestAIC <- c("Dallas", "El Paso", "Midland")
extreme <- txhousing %>% ungroup() %>%
  filter(city %in% c(worstAIC,bestAIC), !is.na(sales)) %>%
  mutate(city = factor(city, c(worstAIC, bestAIC)))

ggplot(extreme, aes(month, log(sales))) +
  geom_line(aes(group=year)) +
  facet_wrap(~city)

TODO: Why the change?

  1. One possible hypothesis that explains why McAllen, Harlingen and Brownsville have lower \(R^2\) is that they’re smaller towns so there are fewer sales and more noise. Confirm or refute this hypothesis.

Looking at the plot on page 234, McAllen (low \(R^2\)) and Bryan-College Station (high \(R^2\)) have similar order of magnitude of sales. This is also apparent looking at a boxplot of their sales.

txhousing %>% ungroup() %>%
  group_by(city) %>%
  filter(city %in% c("McAllen","Bryan-College Station"), !is.na(sales)) %>%
  summarize(meanSales=mean(sales))
## # A tibble: 2 x 2
##                    city meanSales
##                   <chr>     <dbl>
## 1 Bryan-College Station  186.7433
## 2               McAllen  162.1730
selectCities <- txhousing %>%
  filter(city %in% c("McAllen","Bryan-College Station"), !is.na(sales)) 

ggplot(selectCities, aes(city,sales)) + geom_boxplot()

If we look at all of the extreme cities, however, we do see that NE Tarrant County (high \(R^2\)) does have significantly more sales. Also, Brownsville and Harlengen (both having low \(R^2\)) do have fewer sales than the others.

selectCities <- txhousing %>%
  filter(city %in% c("McAllen","Bryan-College Station", "Lubbock", "NE Tarrant County", "Brownsville", "Harlingen"), !is.na(sales)) 

ggplot(selectCities, aes(city,sales)) + geom_boxplot()

  1. McAllen, Harlingen and Brownsville seem to have much more year-to-year variation that Bryan-College Station, Lubbock, and NE Tarrant County. How does the model change if you also include a linear trend for year? (i.e. log(sales) ~ factor(month) + year).
models <- txhousing %>%
  group_by(city) %>%
  do(mod = lm(log2(sales) ~ factor(month) + year, data = ., na.action = na.exclude))
models
## Source: local data frame [46 x 2]
## Groups: <by row>
## 
## # A tibble: 46 x 2
##                     city      mod
##  *                 <chr>   <list>
##  1               Abilene <S3: lm>
##  2              Amarillo <S3: lm>
##  3             Arlington <S3: lm>
##  4                Austin <S3: lm>
##  5              Bay Area <S3: lm>
##  6              Beaumont <S3: lm>
##  7       Brazoria County <S3: lm>
##  8           Brownsville <S3: lm>
##  9 Bryan-College Station <S3: lm>
## 10         Collin County <S3: lm>
## # ... with 36 more rows
model_sum <- models %>% glance(mod)
model_sum
## # A tibble: 46 x 12
## # Groups:   city [46]
##                     city r.squared adj.r.squared     sigma statistic
##                    <chr>     <dbl>         <dbl>     <dbl>     <dbl>
##  1               Abilene 0.6843255     0.6625548 0.2315246 31.433387
##  2              Amarillo 0.5313000     0.4989758 0.2790224 16.436631
##  3             Arlington 0.5773500     0.5482018 0.2498469 19.807350
##  4                Austin 0.6540551     0.6301968 0.2556268 27.414184
##  5              Bay Area 0.6608046     0.6374118 0.2317349 28.248216
##  6              Beaumont 0.5200054     0.4869023 0.2536079 15.708670
##  7       Brazoria County 0.5091646     0.4723519 0.2925529 13.831237
##  8           Brownsville 0.2355866     0.1822555 0.4042607  4.417429
##  9 Bryan-College Station 0.8510313     0.8407576 0.2661372 82.835859
## 10         Collin County 0.6889276     0.6674743 0.2353747 32.112942
## # ... with 36 more rows, and 7 more variables: p.value <dbl>, df <int>,
## #   logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
ggplot(model_sum, aes(r.squared, reorder(city,r.squared))) +
  geom_point()

Harlingen moved up 10 places (comparing the \(R^2\) values) by adding year to the model.

  1. Create a faceted plot that shows the seasonal patterns for all cities. Order the facets by the \(R^2\) for the city.
models <- txhousing %>%
  group_by(city) %>%
  do(mod = lm(log2(sales) ~ factor(month), data = ., na.action = na.exclude))

model_sum <- models %>% glance(mod)
model_sum
## # A tibble: 46 x 12
## # Groups:   city [46]
##                     city r.squared adj.r.squared     sigma statistic
##                    <chr>     <dbl>         <dbl>     <dbl>     <dbl>
##  1               Abilene 0.5298893     0.5003394 0.2817299 17.932066
##  2              Amarillo 0.4493699     0.4147588 0.3015624 12.983427
##  3             Arlington 0.5132337     0.4826369 0.2673618 16.774130
##  4                Austin 0.4873283     0.4551032 0.3102974 15.122641
##  5              Bay Area 0.5552242     0.5272669 0.2646016 19.859696
##  6              Beaumont 0.4304326     0.3946312 0.2754690 12.022792
##  7       Brazoria County 0.5082062     0.4746054 0.2919275 15.124817
##  8           Brownsville 0.1714455     0.1187629 0.4196615  3.254307
##  9 Bryan-College Station 0.6511249     0.6291956 0.4061146 29.692015
## 10         Collin County 0.6009165     0.5758312 0.2658377 23.954970
## # ... with 36 more rows, and 7 more variables: p.value <dbl>, df <int>,
## #   logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
ggplot(txhousing, aes(month, log(sales))) +
  geom_line(aes(group=year)) +
  facet_wrap(~city)

TODO: How to order facets by \(R^2\)? Maybe join datasets and then us reorder statment.

11.6.1 Exercises

  1. Pull out the three cities with the highest and lowest seasonal effect. Plot their coefficients.
coef <- models %>% tidy(mod)
coef
## # A tibble: 552 x 6
## # Groups:   city [46]
##       city            term  estimate  std.error statistic       p.value
##      <chr>           <chr>     <dbl>      <dbl>     <dbl>         <dbl>
##  1 Abilene     (Intercept) 6.5420182 0.07043248 92.883541 7.898616e-151
##  2 Abilene  factor(month)2 0.3537962 0.09960657  3.551937  4.914925e-04
##  3 Abilene  factor(month)3 0.6747930 0.09960657  6.774584  1.826729e-10
##  4 Abilene  factor(month)4 0.7489369 0.09960657  7.518950  2.758003e-12
##  5 Abilene  factor(month)5 0.9164846 0.09960657  9.201046  1.056158e-16
##  6 Abilene  factor(month)6 1.0024412 0.09960657 10.064007  4.372570e-19
##  7 Abilene  factor(month)7 0.9539842 0.09960657  9.577523  9.810940e-18
##  8 Abilene  factor(month)8 0.9337577 0.10125307  9.222019  9.259347e-17
##  9 Abilene  factor(month)9 0.6036668 0.10125307  5.961961  1.342427e-08
## 10 Abilene factor(month)10 0.6084391 0.10125307  6.009093  1.055654e-08
## # ... with 542 more rows
  1. How does strength of seasonal effect relate to the \(R^2\) for the model? Answer with a plot.

  2. You should be extra cautious when your results agree with your prior beliefs. How can you confirm or refute my hypothesis about the causes of strong seasonal patterns?

11.7.1 Exercises

  1. A common diagnostic plot is fitted values (.fitted) vs. residuals (.resid). Do you see any patterns? What if you include the city or month on the same plot?

  2. Create a time series of log(sales) for each city. Highlight points tha thave a standardized residual of greater than 2.